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Abstract 



We present a parallelizable SSOR preconditioning scheme for Krylov subspace 
iterative solvers which proves to be efficient in lattice QCD applications involving 
■ Wilson fermions. Our preconditioner is based on a locally lexicographic ordering 

of the lattice points. In actual hybrid Monte Carlo applications with the bi- 
conjugate gradient stabilized method BiCGstab, we achieve a gain factor of about 
2 in the number of iterations compared to conventional odd-even preconditioning. 



Whether this translates into similar reductions in run time will depend on the 
parallel computer in use. We discuss implementation issues using the 'Eisenstat- 
trick' and machine specific advantages of the method for the APEIOO/Quadrics 
parallel computer. In a full QCD simulation on a 512-processor Quadrics QH4 
we find a gain in cpu-time of a factor of 1.7 over odd-even preconditioning for a 
24 3 x 40 lattice. 



1 Introduction 



The computation of the effects of fermionic forces (from virtual quark-antiquark cre- 
ation and annihilation processes) onto the vacuum structure of quantum chromodynam- 
ics (QCD) presents a severe bottleneck to the numerical evaluation of this fundamental 
theory of strong interactions. 



1 



The stochastic sampling of the vacuum state by the standard Hybrid Monte Carlo algo- 
rithm (HMC) e. g. implies the continual computation of fermionic Greens functions on a 
stochastic source <fi. In terms of the discretized Dirac operator M, the fermionic Greens 
function (or 'quark propagator') is nothing but the solution of the linear equation 

Mx = 0, (1) 

where M is a huge sparse matrix, of rank r = 3 x 4 x V and V is the volume of the 
underlying 4-dimensional space-time lattice. There are different discretizations of the 
Dirac operator in use. For the study of weak interaction processes, the Wilson fermion 
scheme [|IJ appears to be the most attractive. 

Today, parallel computers have matured and provide sufficient compute power to drive 
simulations of full QCD with Wilson fermions to the level of 100 and more Teraflops- 
hours. This performance allows for sufficient sampling of QCD vacuum configurations 
on reasonably sized lattices. As a consequence there is topical interest to (i) pioneer 
new algorithms for generating QCD configurations on one hand || || |j] and (ii) achieve 
progress in improving HMC (as the method of choice) on the other handlil. 

Meanwhile, the stabilized bi-conjugate gradient algorithm (BiCGstab) has been 
established as an efficient inverter in lattice QCD applications f| |7|, f| §] since it 
requires less iterations and less computing time as compared to the minimal residual 
algorithm or CG on the normal equations: improvements of about 50 % (with respect to 
the conjugate gradient) can be achieved in the regime of small quark masses. Numerical 
studies in Refs. || [7|, [T(| indicate that further progress is now to be expected through 
preconditioning. 

In the past, various approaches for preconditioning the Dirac matrix M have been 
taken: 



1. The first attempt started from the perception that in the weak coupling limit 
the quark propagator is diagonal in momentum space and hence Fourier trans- 
formation would act as a useful preconditioner. Such 'Fourier acceleration' ||11|| , 



however, needs smoothness of fields and hence gauge fixing. This handicaps the 
computation with a substantial overhead which at the end of the day eats up 
most of the previous gain. 

2. The state-of-the-art! preconditioning approach in lattice QCd! rests upon the 



1 It appears promising to find that this complementary strategy of tackling the fermion problem in 
numerical field theory joins both physicists and applied mathematicians. 

2 Non-Krylov-subspace based methods like multigrid as of today did not achieve a level of maturity 
to be competitive with inverters like CG in lattice gauge theory. This is due to the disordered nature 
of the gluonic field configurations that act as fluctuating coefficients within the discretized differential 
operator stencils. 

3 The transformation of M from natural into odd-even order might be looked upon as a particular 
SSOR preconditioning, see Section ||[ 
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odd-even decomposition of the matrix M |L2| that yields an efficiency gain by 
a factor of 2 - 3 when inverting M in the context of actual fermionic force and 
quenched propagator computations |L3j, ||. It is crucial that this second precon- 
ditioning approach lends itself easily to parallelization. 



3. A third and promising method has been evaluated by Oyanagi in the 80s (Tj . 
He used the standard incomplete LU (ILU) factorization of the matrix M based 
on the natural, globally lexicographic ordering of the lattice points. For Wilson 
fermions with Wilson parameter r = 1, the ILU preconditioner is identical to the 
SSOR preconditioner with respect to that ordering. 

Ref. found ILU preconditioning (for the conjugate residual method) to out- 
perform the odd-even decomposition on vector machines, achieving a large gain 
factor (in terms of iteration numbers) over conventional unpreconditioned meth- 
ods. As it stands, Oyanagi's method works satisfactorily on vector machines. 
However, on local memory or grid-oriented parallel computers, this preconditioner 
can hardly be implemented efficiently, since parallelism can only be achieved by 
working concurrently on lattice points lying on the same diagonal hyper-plane. 
Hockney Ul5j [16|1 reports that the run times on several parallel and vector com- 
puters actually degrade as compared to the odd-even preconditioning. 

Using the so-called 'Eisenstat-trick' (see Section |2|), the ILU preconditioning can 

5l@. 



be implemented in a more efficient manner than in Jl4], 15] 



In this paper, which is based on the results in [[17]], we intend to open the stage for use of 
general parallel SSOR preconditioning techniques in lattice QCD based on appropriate 
orderings of the lattice points. Our approach may be regarded as a generalization of 



the odd-even (or red-black) ordering to a multiple color layout |1| or, alternatively, as 
a localization of the globally lexicographic ordering. The parallel implementation will 
be carried out in combining two steps: 



1. Partition the whole lattice into equally shaped sublattices and introduce a lexi- 
cographic ordering on each sublattice. 

2. During the SSOR preconditioning step, in parallel sweep through all sublattices 
in lexicographic order. 

In Section || we discuss the SSOR preconditioner in detail and we give efficient im- 
plementations for the SSOR preconditioned BiCGstab and MR methods involving the 
Eisenstat-trick. Section || explains how different orderings of the lattice points lead 

incorporating the Eisenstat trick into the Oyanagi ILU preconditioner could thus lead to a revision 
of some of the conclusions from |a, so that a more detailed numerical study is advisable. 
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to variants of the SSOR preconditioning procedure. In particular, this section intro- 
duces the locally lexicographic ordering through which we obtain the improvements 
over odd-even preconditioning. 

In Section |], the parallel implementation in conjunction with sub-blocking is discussed. 

Section |5| presents the behavior of our new preconditioning method for Hybrid Monte 
Carlo applications and its performance on APEIOO/Quadrics parallel computers. On 
lattices of size 8 3 x 16 we test the efficiency of the method numerically where we 
find improvement factors of about 2 in terms of iteration numbers compared to the 
standard odd-even method. We discuss the effects of granularity and further report 
about first experiences with the preconditioner in a large scale simulation using a 512- 
node APEIOO/Quadrics on a 24 3 x 40 lattice, in the chirally sensible region. 



2 SSOR Preconditioning 

When preconditioning ([!]), we take two non-singular matrices V\ and Vi which act as 
a left and a right preconditioner, respectively, i.e. we consider the new system 

V^MV^x = 4>, where = VfV, x = V 2 x. (2) 

(In the following, preconditioned quantities will be denoted by tildes.) We could now 
apply BiCGstab, as given in Algorithm |I| for the original system, (or any other Krylov 
subspace method) directly to (0), replacing each occurrence of M and (f> by V{~ MV 2 ~ X 
and (p, resp. However, this would yield the preconditioned iterates x k together with pre- 
conditioned residuals. Therefore, one usually reformulates the algorithm incorporating 
an implicit back-transformation to the unpreconditioned quantities. The resulting algo- 
rithm is similar to Algorithm [I], requiring two additional systems with matrix V = V\\f 2 
and two systems with matrix V\ to be solved in each iterative step (see 0). 

The purpose of preconditioning is to reduce the number of iterations and the computing 
time necessary to achieve a given accuracy. This means that V has to be a sufficiently 
good approximation to the inverse of M (thus decreasing the number of iterations) while 
solving systems with V, V\ should be sufficiently cheap (since these solves represent the 
overhead of the preconditioned upon the basic method). 

In the present work we are only interested in the SSOR (or, more precisely, symmet- 
ric Gaufi-Seidel) preconditioner. Consider the decomposition of M into its diagonal, 
strictly lower and strictly upper triangular parts, i.e. 

M = 1 -L-U 
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{ initialization } 

choose xq, set r = <fi — Mx$, f = r 
set p = pi = a = uj = 1 
set vq = po = 
{ iteration } 
for i= 1,2, ... 

Pi = 

7i = (Pi/Pi-i)(ai-iM-i) 

Pi = 7"j_i + 7i(pi_l - Wi-iUi-i) 

Vi = Mpi 

Si = n-i - otiVi 
U = Msi 

Xi = Xi-i + UiSi + 

^2 ^iti 



Algorithm 1: BiCGstab method 



with L, U strictly lower and upper triangular matrices, respectively. Then the SSOR 
preconditioner is given by 

V X = I-L, V 2 = I-U. (3) 



For the SSOR preconditioner we have V\ + Vi — M — I. This important relation can be 
exploited through the 'Eisenstat-trick' fl9| , because we now have V^MVz 1 = V 2 l + 
Vi X (I — V 2 _1 ), so that the matrix vector product w = l / 1 _1 MV r 2 _1 r can economically 
be computed via 

v = V 2 ~ 1 r, u = V 1 _1 (r — v ), w = v + u. 

In this manner we get the standard algorithmic formulation of the SSOR preconditioned 
BiCGstab method, stated as Algorithm 

The Eisenstat trick is not restricted to the BiCGstab method but can be applied in 
any Krylov subspace method. As another example, we state the SSOR preconditioned 
minimal residual (MR) method as Algorithm |3]. 

Note that these algorithms use the preconditioned residuals fj which are related to the 
unpreconditioned residuals r j = <p — Mxi via 

n = (I - L)fi. 

To save computational costs, a stopping criterion for Algorithm |2| will usually be based 
on fj. Upon successful stopping, one can then compute r, and test for convergence 
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{ initialization } 

choose xo, set r = — Mx° 

solve (J — L)f = r to get f { forward solve } 

r = f 

set p = pi = a = lu = 1 
set vq = po = 
{ iteration } 

fori = 1,2,... 

~t~ 
Pi = r rj_i 

7* = (Pi/Pi-i)(<*i-i/wi-i) 

Pi = fi-i + 7i(pi-l - Ui-iVi-!) 

solve (/ — C/)zj = pi to get Zj { backward solve } 

solve (J — L)wi = pi — Zi to get Wi { forward solve } 

Vi = Zi + Wi 

a i = Pi/ r O v i 

Si = fi-i - CXiVi 

solve (I — U)yi = Si to get yi { backward solve } 

solve (I — L)ui = Si — yi to get Ui { forward solve } 

U = yi + uj 

UJi = t\si/t\ti 

%i = Xi-i +uj { yi + a { Zi 

Ti Si UJiti 

Algorithm 2: SSOR preconditioned BiCGstab 
using If the solution is not yet accurate enough, one continues the iteration with a 

{ initialization } 

choose xo, set r = — Mx° 

solve (I — L)r = r to get f { forward solve } 

{ iteration } 

for % — 0, 1, . . . 

solve (I — U)iVi = fi to get Wi { backward solve } 
solve (I — L)vi = fi — Wi to get Vi { forward solve } 

pi = Wi + Vi 

«< = r\pi/p\pi 

X i+ i = Xi + OtiWi 

f i+ i = fi- atipi 



Algorithm 3: SSOR preconditioned minimal residual 
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stronger stopping criterion. In our numerical experiments to be reported in Section [5] 
it turned out that such additional steps were never necessary. 

Also note that multiplications with M are completely absent in this formulation, the 
only matrix operations being solves with each of the matrices I — L and I — U. Since 
these matrices are triangular, the solves can be done directly via forward or backward 
substitution, respectively. For example, denoting the components of a vector y by (y)i 
and the entries of L by (L)ij and similarly for U, the forward solve (I — L)y = p and 
backward solve (I — U)z — p become simply 

for i — 1, . . . , n for i — n, . . . , 1 

i— 1 n 

{y)i = (p)i + Y,( L Myh (*)* = (p)i + E 

3=1 i=*+l 

forward solve backward solve 



Due to the sparsity pattern of M, most of the entries in L and U are zero so that only 
a few j actually contribute to the sums over j. We will come back to this point in more 
detail in the next section. 

It is important to note that for i fixed, the number of non-zero entries of L and U 
involved when updating y^ in the forward together with Zi in the backward solve is 
just the number of non-zero entries of the i-th row of the matrix M. Therefore, in 
terms of computational cost, a forward followed by a backward solve is quite precisely 
as expensive as a multiplication with M (which is required in the unpreconditioned 
method). Hence, due to the Eisenstat-trick, in terms of matrix vector operations, each 
step of the SSOR preconditioned BiCGstab (or MR) method requires the same work 
as in the unpreconditioned method. 

We finally note that the 'classical' symmetric GauB-Seidel iteration 

(/ - L)x k+l/2 = Ux k + 0, (/ - U)x k+1 = Lx k+1/2 + 0, k = 0, 1, . . . 

represents a sequence of alternating forward and backward solves. This explains the 
terminology SSOR preconditioner where SSOR stands for symmetric successive over- 



relaxation, the over-relaxed variant of the symmetric GauB-Seidel method. See |20 
e.g. 
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3 Orderings 



When writing down equation ([I]) with Wilson fermion matrix mB 

4 

M X ,y = 5 Xt y-K^(l-J^)U^(x)S Xt y^ + (l + J^)Ul(X-ll)6 X; y +fJl 

4 

= °~x,y K m x,y ^x,y—fi 4" m x,y ^x,y+pi- (4) 
H=l 

we have the freedom to choose any ordering scheme for the lattice points x. Different 
orderings yield different matrices M, however, which are permutationally similar to 
each other, i.e. one matrix can be retrieved from the other via the transformation 
M — > P'MP with a permutation matrix P. In general, P^LP and P^UP will not 
represent the strictly lower or strictly upper triangular part of P^MP. Consequently, 
the SSOR preconditioned matrix for P^MP will not be permutationally similar to 
that of M, so that the quality of the SSOR preconditioner will usually depend on 
the ordering scheme chosen. One purpose of this paper is to explain that the odd- 
even preconditioning and the Oyanagi preconditioning can both be regarded as the 
SSOR preconditioning belonging to particular orderings of the grid points!. We will 
then introduce a new 'locally lexicographic' ordering which is adapted for parallel 
computation and for which the SSOR preconditioned system can be solved more rapidly 
than with odd-even preconditioning with respect to both, the number of iterations as 
well as actual run time on particular parallel computers. 

Consider an arbitrary numbering (ordering) of the lattice points. For a given grid 
point x, the corresponding row in the matrix L or U contains exactly the coupling 
coefficients of those nearest neighbors of x which have been numbered before or after 
x, resp. Therefore, a generic formulation of the forward solve for this ordering is given 
by Algorithm [|. The backward solves are done similarly, now running through the grid 
points in reverse order and taking those grid points x ± \i which were numbered after 
(instead of before) x. Due to this analogy, we will restrict our discussion to the forward 
solves in the sequel. 



Odd-even ordering. As a first specific example, let us consider the familiar odd- 
even ordering where all odd lattice points are numbered before the even ones. From 
our generic formulation in Algorithm ^ we then get Algorithm [5] for the forward solve. 

In traditional QCD computations, the odd-even preconditioning is not implemented by 

5 The 4 matrices 7 M are 4x4 Dirac matrices, and the V M are 3x3 SU(3) matrices that represent 
the gluonic degrees of freedom on the lattice. 

6 It is shown in |[7) that the SSOR preconditioner for the Wilson fermion matrix (with Wilson 
parameter r = 1) is identical to the ILU preconditioner for any ordering of the lattice points 
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for all grid points x in the given order 
{ update y x } 

Vx = Px 

for /i — 1, . . . , 4 

if x — fi was numbered before x then 

for n = 1, . . . , 4 

if x + /i was numbered before x then 

Ux = Ux + k ~ m x ,x+^yx+ii 



Algorithm 4: Generic forward solve 



for all odd grid points x 

Vx = Px 

for all even grid points x 
{ update y x } 

(4 4 



Algorithm 5: Odd-even forward solve 



using the above formulation of the forward (and backward) solve in Algorithm 0. This 
is due to the fact that — very exceptionally — for this particular ordering the inverses of 
I — L and I — U can be determined directly: With the odd-even ordering, the matrix 
M has the form 

so that 

' - ^ = ( ; ) ^ ^ - i)- 1 = 

and 

/ - U = f J ~ K ^ oe ] with (/ - f/)- 1 = ( J ^ oe 




Hence 



(I - L)- l M(I - U)- 1 = I I 



where 1 — n 2 D eo D oe is called the matrix of the odd-even reduced system. Very excep- 
tionally again, in this particular case, the preconditioned matrix (/ — L)~ l M(I — U)^ 1 
thus has such a simple structure that it is possible to apply Algorithm 1 directly, using 
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the preconditioned matrix explicitly. Since the matrix (/ — L) -1 M(J — U)^ 1 is 2 x 2 
block diagonal with the first diagonal block being the identity, the computations for 
the second diagonal block are completely decoupled from those of the first block and 
they are identical to those which are performed when applying the algorithm for the 
odd-even reduced system directly. This is precisely what is done in traditional odd- 
even preconditioning. So traditional odd-even preconditioning is equivalent to SSOR 
preconditioning for the odd-even ordered system. 

So far, odd-even preconditioning is generally considered as the only successful precon- 
ditioner in a parallel computing environment. For typical, realistic configurations, it 
gains a factor 2-3 in the numbers of iterations and also in computing time as compared 
to solving the unpreconditioned system. 



Globally lexicographic ordering. Assume now that M is given with respect to 
the natural (lexicographic) ordering of the lattice points. This means that grid point 
x = {iii i'i-, ^4) is numbered before x' = (i[, i' 2 , i' 3 , z 4 ) if and only if (z 4 < i 4 ) or {i 4 = i' 4 
and i 3 < i' 3 ) or (z 4 = i 4 , i 3 = i' 3 and i 2 < i' 2 ) or (z 4 = i'^ij, = i 3 ,i2 = i'2 an d i\ < i[)- 
The corresponding forward solve is given as Algorithm |6]. 



for i 4 = 1, . . . , n 4 
for i 3 = l,...,n 3 
for i 2 — 1, . . . ,n 2 
for %\ — 1, . . . , rii 

x := «2, *3, «4) 

{ update y x } 

Vx = Px 

for /J, — 1, . . . , 4 

if > 1 then y x =y x + K m+^y^ 



Algorithm 6: Lexicographic forward solve 

The SSOR preconditioning for the lexicographic ordering, applied to the MR and other 
conjugate residual methods, was considered by Oyanagi |14]]. He showed that it yields 
a further improvement over odd-even preconditioning as far as the number of iterations 
is concerned. However, its parallel implementation is more difficult and less efficient, 
since only grid points lying on the same diagonal hyper-plane can be worked upon 
in parallel in the forward and backward solves. Hockney |16j reports that on the 592 
processor ACPM APS at Fermi-Lab, the run time (without the Eisenstat trick) actually 



degrades as compared to odd-even preconditioning. See also [R 
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Figure 1: Locally lexicographic ordering and forward solve in 2 dimensions 

Locally lexicographic ordering. Unlike the lexicographical and the odd-even or- 
dering, the ordering we propose now is adapted to the parallel computer used to solve 
equation (|l|). We assume that the processors of the parallel computer are connected 
as a pi x p 2 x p 3 x p 4 4-dimensional grid. Note that this includes lower dimensional 
grids by setting some of the Pi to 1. The total number of processors is p = piP2PzPi- 
The space-time lattice can be matched to the processor grid in an obvious natural 
manner, producing a local lattice of size n l ° c x n^ 00 x n£ c x n l £ c with n l ° c = rii/pi on 
each processor. Here, for simplicity, we assume that each pi divides rij and that we 
have n l ° c > 2 for i = 1, . . . , 4. 

Let us partition the whole lattice into n loc = n l ° c n l £ c n l 3 C n l £ c groups. Each group corre- 
sponds to a fixed position of the local grid and contains all grid points appearing at this 
position within their respective local grid. Associating a color with each of the groups, 
we can interpret this process as a coloring of the lattice points. For the 2-dimensional 
case, such a coloring is depicted in Figure 1 (with n l ° c = n 1 ™ = 4), where the 16 colors 
are denoted by the letters a - q. 

We now consider an ordering where all points of color b are ordered after all those of 
color a if on the local grids the position belonging to color a is lexicographically less 
than that of color b. In Figure 1, this corresponds to the alphabetic ordering of the 
colors a - q. Such an ordering is termed locally lexicographic. From now on we will use 
the prefix l ll- in expressions like '//-first', '//-ordering' to make clear that they refer to 
the locally lexicographic ordering. 

Since we assumed n l ° c > 2 for all i, all nearest neighbors of a given grid point have 
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colors different from that point. This implies that when performing the forward and 
backward solves in Algorithm 0, grid points having the same color can be worked upon 
in parallel, thus yielding an optimal parallelism of p, the number of processors. A 
formulation of the //-forward solve is given as Algorithm [j]. Here, we use as a 
symbol for '//-less than'. 



for all colors in lexicographic order 
for all processors 

x := grid point of that color on that processor 
{ update y x } 




Vx=Px + k\ m tx~^Vx-^ + m x,x+^x +l , 

^,x+n< u x 



Algorithm 7: //-forward solve 

For grid points lying in the 'interior' of each local grid, we have x — /i <u x <u x + /j, 
for ji = 1, . . . , 4. The update thus becomes 

Vx = Px + « m tp-v3l*-l\ , 

whereas on the 'local boundaries' we will have between (for the //-first point) and 8 
(for the //-last point) summands to add to p x . The second sum in the update formula 
then contains those grid points which belong to neighboring processors. The arrows in 
Figure 1 illustrate this situation for the 2-dimensional analogue, where 2 grid points 
contribute to the update at locally interior points (colors /, g, k, /) and to 4 on the 
local boundaries. (An arrow from point x ± // to x means that x ± /j, appears in the 
sum for updating point x.) 

With the lexicographic ordering on the whole lattice, a forward solve followed by a 
backward solve has the effect that the information at any grid point will have spread 
out to all other grid points. This yields a heuristic justification of why this ordering 
is superior (as far as convergence properties are concerned) to the odd-even order- 
ing, where the information is passed to the (second-) nearest neighbors only. The 
//-ordering can be regarded as a compromise between these two extreme cases, spread- 
ing information to all points within a local grid. The //-forward and //-backward solves 
parallelize efficiently and better than with the lexicographic ordering on the whole 
grid. The parallelism achieved is p and thus less than with the odd-even ordering, but 
it is optimal since we have p processors. If we change the number of processors, the 
//-ordering, and consequently the properties of the corresponding SSOR preconditioner 
will change, too. Heuristically, we expect the convergence properties to degrade as 
the size of the local grid becomes smaller but to always remain better than with the 
odd-even preconditioner (as long as n- oc > 2 for i = 1, . . . , 4). 
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4 Parallel Implementation 



In this section we give a detailed discussion on the realization of the //-forward solve 
on a parallel computer. In order to render all necessary communication transparent we 
assume a model parallel computer where communication is done via explicit message 
passing between processors. 

As in the previous section, we assume that the processors are connected as a 4- 
dimensional (or lower dimensional) grid and that the space-time lattice is mapped 
onto that grid in the obvious manner, yielding a local subgrid on each processor. 
The local coordinates of a grid point x within a subgrid are denoted by the 4-tuple 
x = (it, h)- For processor 7r of the processor grid we denote its neighbors in 
dimension fi by 7r ± /i. Incorporating all necessary data transmission into Algorithm [7] 
via message passing statements — sending messages as soon as possible while receiving 
them as late as possible — we end up with Algorithm [8] which formulates the //-forward 
solve for processor tt. 

for z 4 = 1, . . . , n l £ c 
for i 3 — 1, . . . , n l £ c 
for 2*2 = 1, ... , n 2 0C 
for %i = 1, . . . , n l ° c 
x := (ii,i 2 ,H,u) 
{ update y x } 

Vx = Vx 

for n = 1, . . . , 4 

if 2 M > 1 then y x = y x + Km+ X _^y x _^ 
for /i = 1, . . . , 4 
if = 1 then 

send y x to processor n — jj, 
if in = n l ° c then 

receive y x +n from processor n + fi 
Vx — Vx + Km x , x + fJ .y x +t j - 

Algorithm 8: //-forward solve on processor 7r 

Note that if processor 7r issues a 'send' to transfer y x to processor n — ft because of 
i^ = 1, the corresponding 'receive' (for i^ = n l ° c ) in processor n — fi reads this data as 
y x i with x' — x + jj,. If jj, — 1, exactly n l ° c — 2 updates are done between a matching 
send/receive pair. If fi = 2 this number increases to n l ° c {n l £ c — 2), for [i = 3 it becomes 
n ioc n ioc^ n ioc _ 2) anc [ f or ^ = 4 it is n l ° c n l 2 C n l ^ c (n l f c — 2). Therefore, if messages can 
be handled in parallel to the computation, the message passing can be hidden behind 
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Figure 2: 11 x oe ordering and forward solve, 2 dimensions 

the computation so that the communication overhead will be small. Also note that if 
— 1 the send/receive pairs for direction \i have to be dropped in Algorithm [8]. 

Algorithm [| has to communicate the local boundary values all one by one. This is 
inherent in the //-ordering and can become a disadvantage on machines which have a 
large start-up time for communication, irrespective of the message length. In these 
cases, it is preferable to issue only a small number of (large) messages. Then the 
following outer product' of the //-ordering with an odd-even ordering can be used: 
Divide each local grid into two halves G±, G2 by bisecting perpendicularly to direction 
4 (say), where n l £ c is supposed to be a multiple of 2. Identify odd and even processors 
on the processor grid, neglecting the 4-th coordinate, i.e., processor n = (ji, J2, J3, J4) 
is odd if and only if j\ + j 2 + J3 is odd. On each even processor, we first take a locally 
lexicographic ordering on G\ and then on G2, on odd processors we reverse the role 
of G± and Gi- The forward solve with respect to this ordering consists of two half 
steps. First, the odd processors update G\ while the even processors work on G 2 - No 
communication is necessary there. Then the updated values on the local boundaries 
have to be communicated (requiring a total of 7 large messages, one for direction 4 
and two for each other direction). Now the second half step can be performed, where 
odd processors update G 2 and even processors update G\. An illustration of this 
oe x //-ordering for the 2-dimensional case is given in Figure 0. 

Implementing the ZZ-SSOR preconditioning requires to distinguish between grid sites 
along a given direction. Assume in a forward solve that a grid point is located at 
the 'left' boundary of the local lattice. This site is not updated with data from the 
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left neighbor, and this fact has to be inquired by a conditional statement or a case 
statement. The second case is that a grid site is located at a 'right' boundary where the 
right neighbor has to be communicated from the neighbor sublattice (in this situation 
the neighbor processor). The third possibility is that the site is not a boundary grid 
point. This leads to 81 different cases a site can be associated with in four dimensions. 



5 Results 



Our numerical tests of the locally lexicographic SSOR preconditioner were performed 
on APEIOO/Quadrics machines, a SIMD parallel architecture optimized for fast floating 
point arithmetic on block data-structures like 3x3 SU(3) matrices or any other data 
structure that can be blocked similarly. Arithmetic on the Quadrics is compliant with 
the IEEE 754 single precision standard. We had access to a 32-node Quadrics Q4 
at Wuppertal University and a 512-node Quadrics QH4 at the computer center of 
ENEA/Casaccia Rome. 

In order to speed up the performance on the Quadrics it is favorable (as it is the case 
on most high speed parallel computers) to partition the code into code blocks i. e. 
sections that can be computed entirely from the registers without recourse to the local 
or remote memory, thus using the pipelining features of the processors efficiently This 
requires to modify Algorithm || by moving all if-statements out of the four-fold nested 
loop over z 1; . . . ,i 4 resulting in a sequence of 81 nested loops, one for each possible 
combination of the three cases i M = 1, 1 < < n l ° c , = n l ° c for /x = 1, . . . , 4. 



Evidence of improvement. In Fig. fj first evidence of improvement by //-SSOR 
preconditioning of BiCGstab is presented. Shown are three different iteration numbers 
of a valence mass "trajectory" for a given thermalized full-QCD configuration at (3 = 5.6 
and K aoa = 0.156 corresponding to an intermediate pion mass. The standard BiCGstab 
solution of ([[]) is compared with the odd-even preconditioned method and the new 
locally lexicographic ordering scheme. The measurements are taken on a lattice of size 
8 3 x 16, hence the computational problem being of granularity g = N/p = 256 on the 
Quadrics Q4 with 32 nodes. In this exploratory implementation, as discussed in the 
previous sections, the sublattice administrated by a given processor is used as the local 
lattice in the //-SSOR scheme. At this place, the convergence stopping criteria, 

. . r \\Mx-<p\\ 2 
stop iteration it — — < e, (6) 

are applied on the three different residuals used in the three variant algorithms, re- 
spectively, with equal numerical value e = 10~ 8 . We verified that the 'true' residuals 
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Mx — 0, calculated explicitly from the solution x, eventually were smaller for the pre- 
conditioned variants of the BiCGstab solver. The gain in iterations is more than a 
factor of 2 for K valcncc = K sca as is the case within the entire valence k range considered. 
Compared to the unpreconditioned BiCGstab the gain is even a factor of 4. 

The improvement in terms of cpu-time, on the Q4, is smaller than the gain in iteration 
numbers since the //-SSOR preconditioning leads to certain breaks in code blocks due 
to the 81 different cases mentioned above. Thus, pipelined arithmetic is not exploited 
as well as in the other cases. For the given choice of parameters we find an improvement 
of about a factor of about 1.5 compared to standard oe-preconditioning throughout the 
valence mass range considered. 

The convergence behavior of the three methods is demonstrated in Fig. |] on a typical 
configuration at the same parameter values for K sca = k VSj1 = 0.156. In addition to 
the acceleration, the convergence behavior turns out to be much smoother for the //- 
preconditioned BiCGstab than in the unpreconditioned case. We have plotted both the 
accumulated and the 'true' residual. The latter approaches a horizontal line for ||r|| 2 ~ 
10 -6 which indicates the limits of single precision accuracy. For //-preconditioning, 
however, the final residual that can be achieved appears to be somewhat more accurate 
than for pure BiCGstab. 




Figure 3: Iteration numbers of pure BiCGstab •, the oe- + and the //-preconditioned 
version □ for a series of valence masses on a 8 3 x 16 lattice. The second plot shows the 
cpu-time needed on a 32-node Quadrics Q4 at granularity g = 256. 
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Figure 4: Convergence behavior of BiCGstab and its oe- and //-preconditioned variants 
on an 8 3 x 16 lattice. 

Local lattice size dependency. It is important to assess the dependency of the 
improvement due to ZZ-SSOR preconditioning on the size of the local lattice. As our 
Q4 is equipped with a fixed number of 32 processors, we emulated larger numbers of 
local lattices by sub-blocking the system on each processor. The smallest local lattice 
size we can investigate is a 2 4 lattice, apart from the unpreconditioned case (which we 
treat as having local lattice size 1) and the oe-realization of the solution of ([]]) (local 
lattice size 2). 

The dependency of the number of iterations on the local lattice size is depicted in Fig. ^| 
on an equally sized lattice as used before and at equal physical parameters but with 
K sca = 0.1575. As expected, smaller local lattices lead to a less efficient preconditioning. 
However, the number of iterations at first increases only slightly if we move from a local 
lattice size of 256 to 16. This behavior might of course be dependent on K aoa , hence the 
crossover to the oe-result (the diamond in Fig. ||) might occur at a larger local lattice 
size for /t sca closer to k c . 

Speeding up QCD computations. Next we discuss improvements due to the ZZ- 
SSOR preconditioning implemented in an actual Hybrid Monte Carlo simulation of full 
QCD with Wilson fermions. We depict in Fig. |6|a a time series of iteration numbers 
that are each averages over a molecular dynamics trajectory of a length of 125 steps. 
The measurements are taken on a system of size 24 3 x 40 at (3 = 5.6 and K sca = 0.1575. 
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Figure 5: Dependence of iteration numbers on the //-lattice size, the lattice volume is 
8 3 x 16. 



The first part of the curve has been generated using standard oe-preconditioning for the 
BiCGstab inverter. The sudden step in iterations is due to the fact that we switched 
the preconditioner to ZZ-SSOR at this location. The iteration numbers show a quite 
stable value in both branches of the curve and hence nicely demonstrate the stability of 
the preconditioning on different background field configurations within the canonical 
ensemble. In the actual large scale simulation, the ZZ-SSOR preconditioning scheme 
indeed gives an improvement of nearly a factor of 2 as to the iteration numbers for a 



quite light pion mass pi 



We remark that we have verified in our simulation that the educated guess procedure 
as advocated by Brower et al. leads to an improvement that is additive to the gain 
as achieved by the use of ZZ-SSOR preconditioning. 

The time behavior of the system on the 512-node APEIOO/Quadrics QH4 is depicted 
in Fig. ^b. We remark that on this machine topology with 8x8x8 nodes on a 3- 
dimensional grid, we arrive at a local grid size of 3 x 3 x 24 x 5 for the 24 3 x 40 lattice. 
On a strict SIMD architecture without local addressing, as is the case on the Quadrics, 
oe-schemes are quite tricky to implement and lead to a substantial loss in performance 
in the range of 30 %. Therefore ZZ-SSOR can help a lot to overcome the SIMD odd-even 
limitations. The total improvement as to cpu-time is about a factor of 1.7. 

Details on our implementation and the experience with ZZ-SSOR preconditioning in the 



context of large scale HMC simulations in full QCD will be published elsewhere [23 . 
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Figure 6: Section of the time series of iteration numbers together with cpu-times re- 
quired on a QH4 from a full QCD HMC simulation on a 24 3 x 40 lattice. The graphics 
shows the location where we switched from the oe-representation of the fermionic ma- 
trix M to the //-scheme. 



6 Conclusion and Outlook 



We have presented a new local grid point ordering scheme that allows to carry through 
efficient preconditioning of Krylov subspace solvers like BiCGstab in the context of 
computations of Greens functions for lattice quantum chromodynamics with Wilson 
fermions on parallel computers. 

Using the Eisenstat-trick we can avoid the explicit multiplication by the fermionic 
matrix M and we remain with a forward and a backward solve to invert the upper and 
lower triangular parts of M. 

In actual Hybrid Monte Carlo simulations with Wilson fermions at small pion masses 
and on large lattices, we verify an improvement of a factor of 2 in BiCGstab concerning 
iteration numbers as compared to the state-of-the-art odd-even preconditioned scheme. 
In this manner, our code is thus speeded up on a 512-node APEIOO/Quadrics by a factor 
of about 1.7 at (3 = 5.6 and /t soa = 0.1575. 

We are confident that the //-SSOR scheme will help to drive future simulations with 
dynamical Wilson fermions deeper into the chiral region dominated by a small pion 
mass. 
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